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Abstract 

Gene-regulatory enhancers have been identified using various approaches, including evolutionary conservation, regulatory 
protein binding, chromatin modifications, and DNA sequence motifs. To integrate these different approaches, we 
developed EnhancerFinder, a two-step method for distinguishing developmental enhancers from the genomic background 
and then predicting their tissue specificity. EnhancerFinder uses a multiple kernel learning approach to integrate DNA 
sequence motifs, evolutionary patterns, and diverse functional genomics datasets from a variety of cell types. In contrast 
with prediction approaches that define enhancers based on histone marks or p300 sites from a single cell line, we trained 
EnhancerFinder on hundreds of experimentally verified human developmental enhancers from the VISTA Enhancer Browser. 
We comprehensively evaluated EnhancerFinder using cross validation and found that our integrative method improves the 
identification of enhancers over approaches that consider a single type of data, such as sequence motifs, evolutionary 
conservation, or the binding of enhancer-associated proteins. We find that VISTA enhancers active in embryonic heart are 
easier to identify than enhancers active in several other embryonic tissues, likely due to their uniquely high GC content. We 
applied EnhancerFinder to the entire human genome and predicted 84,301 developmental enhancers and their tissue 
specificity. These predictions provide specific functional annotations for large amounts of human non-coding DNA, and are 
significantly enriched near genes with annotated roles in their predicted tissues and lead SNPs from genome-wide 
association studies. We demonstrate the utility of EnhancerFinder predictions through in vivo validation of novel embryonic 
gene regulatory enhancers from three developmental transcription factor loci. Our genome-wide developmental enhancer 
predictions are freely available as a UCSC Genome Browser track, which we hope will enable researchers to further 
investigate questions in developmental biology. 
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Introduction 

Eukaryotic gene expression is regulated by a highly orchestrated 
network of events, including the binding of regulatory proteins to 
DNA, chemical modifications to DNA and nucleosomes, recruit- 
ment of the transcriptional machinery, splicing, and post- 
transcriptional modifications. Enhancers are genomic regions that 
influence the timing, amplitude, and tissue specificity of gene 
expression through the binding of transcription factors and co- 
factors that increase transcription (as reviewed in [1,2]). In 
humans, genetic variation in enhancer regions is implicated in a 
wide variety of developmental disorders, diseases, and adverse 
responses to treatments [3,4,5]. 

Enhancers have been discovered in introns, exons, intergenic 
regions megabases away from their target genes [6], and even on 



different chromosomes [7]. An enhancer frequently drives only 
one of many domains of a gene's expression [8,9] and different cell 
types accordingly exhibit considerable differences in their active 
enhancers [10,11]. This modularity enables the creation of 
complex regulatory programs that can evolve relatively easily 
between closely related species [12,13]. 

Individual enhancers were initially identified using transgenic 
assays in cultured cell lines [14,15] and later in vivo in model 
organisms, such as mouse, Drosophila, and zebrafish. In the in vivo 
experiments, a construct containing the sequence to be tested for 
enhancer activity, a minimal promoter, and a reporter gene (e.g., 
lacZ) is injected into fertilized eggs, and transgenic individuals are 
assayed for reporter gene expression. 

Early efforts to find enhancers at the genome scale used 
comparative genomics. Several studies assayed non-coding regions 
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Author Summary 

The human genome contains an immense amount of non- 
protein-coding DNA with unknown function. Some of this 
DNA regulates when, where, and at what levels genes are 
active during development. Enhancers, one type of 
regulatory element, are short stretches of DNA that can 
act as "switches" to turn a gene on or off at specific times 
in specific cells or tissues. Understanding where in the 
genome enhancers are located can provide insight into 
the genetic basis of development and disease. Enhancers 
are hard to identify, but clues about their locations are 
found in different types of data including DNA sequence, 
evolutionary history, and where proteins bind to DNA. 
Here, we introduce a new tool, called EnhancerFinder, 
which combines these data to predict the location and 
activity of enhancers during embryonic development. We 
trained EnhancerFinder on a large set of functionally 
validated human enhancers, and it proved to be very 
accurate. We used EnhancerFinder to predict tens of 
thousands of enhancers in the human genome and 
validated several of the predictions near three important 
developmental genes in mouse or zebrafish. EnhancerFin- 
der's predictions will be useful in understanding functional 
regions hidden in the vast amounts of human non-coding 
DNA. 

conserved across diverse species for enhancer activity [16,17,18], 
since functional non-coding regions likely evolve under negative 
selection. This approach identified many enhancers at a range of 
levels of evolutionary conservation [19,20,21]. However, relying 
on evolutionary conservation alone has several shortcomings: 
many characterized enhancers are not conserved between species 
[22], non-coding conservation is not specific to enhancer elements, 
and evolutionary patterns provide little information about the 
tissue and timing of enhancer activity. 

Enhancer prediction has been revolutionized by recent techno- 
logical advances, including chromatin immunoprecipitation cou- 
pled with high-throughput sequencing (ChlP-seq) [23], RNA 
sequencing (RNA-seq), and sequencing of DNasel-digested 
chromatin (DNase-seq) [24] or formaldehyde-assisted isolation of 
regulatory elements (FAIRE-seq) [25]. These "functional geno- 
mics" assays enable genome-wide measurement of histone 
modifications, binding sites of regulatory proteins, transcription 
levels, and the structural conformation of DNA. The ENCODE 
project [26], FANTOM project [27], and similar studies focused 
on specific cell types [28,29] have dramatically increased the 
amount of publicly available functional genomics data. 

Functional genomics studies revealed several genomic signatures 
of active enhancers. For example, known enhancers are associated 
with the unstable histone variants H3.3 and H2A.Z [30,31] and 
low nucleosome occupancy [32], although these chromatin states 
are not unique to enhancers. Monomethylation of lysine 4 on 
histone H3 (H3K4mel), a lack of trimethylation at the same site 
(H3K4me3), and acetylation of lysine 27 on histone H3 
(H3K27ac) may distinguish active enhancers from promoters 
[10,33,34], enhancers that are "poised" for activity later in 
development [35,36], and regulatory elements that repress gene 
expression [37,38]. Additional features that pinpoint specific 
classes of active enhancers include binding of the transcriptional 
cofactor p300/CBP [18,39,40,41], clusters of transcription factor 
(TF) binding sites [42,43,44,45], and enhancer RNA transcription 
(eRNAs) [46]. Collectively, functional genomics data have 
pinpointed the locations of many novel enhancers and yielded 
insights into sequence and structural determinants of enhancer 



activity. However, these patterns have not proven to be universal 
[47,48], and there is unlikely to be a single chromatin signature 
that identifies all classes of enhancers [11,49,50]. 

Given the complexity of these functional genomics data sets, 
computational methods have been developed to improve and 
generalize the enhancer predictions made from simple combina- 
tions of these data. Support vector machines (SVMs) and linear 
regression models trained to interpret DNA sequence motifs 
underlying known enhancers have successfully identified novel 
enhancers active in heart [51], hindbrain [52], and muscle [53] 
development. Another approach used SVMs to learn patterns of 
short DNA sequence motifs that distinguish markers of potential 
enhancers, such as p300 and H3K4mel, in different cellular 
contexts [54,55]. Random forests have been used to predict p300 
binding sites from histone modifications in human embryonic stem 
cells and lung fibroblasts [56]. Machine-learning algorithms have 
also been applied to the related problem of selecting functional TF 
binding sites out of the thousands of hits to a TF's binding motif 
throughout the genome [57,58,59,60,61,62,63]. Finally, two 
groups have taken a less supervised approach and used hidden 
Markov models (ChromHMM) [64] and dynamic Bayesian 
networks (Segway) [65] to segment the human genome into 
regions with unique signatures in ENCODE data and then 
assigned potential functions, such as enhancer activity, to these 
states. 

While rich datasets coupled with sophisticated algorithms have 
successfully identified many novel enhancers, comprehensive 
enhancer prediction is challenging for two main reasons. First, 
no single type of data is currendy sufficient to identify all 
enhancers active in a given context. Many of the approaches 
described above use a single mark or motif as a proxy for an 
enhancer, but this gives an incomplete representation of all 
biologically active enhancers. Second, while a great deal of 
functional genomics data are available for different cell lines and 
tissues, it is not understood how informative experiments in a given 
cellular context are indicative of enhancer activity in other 
contexts. 

With these issues in mind, we introduce EnhancerFinder, a new 
two-step machine-learning method for predicting enhancers and 
their tissue specificity. In machine learning, a classification 
algorithm is trained to distinguish between labeled training 
examples (e.g., enhancers and non-enhancers) based on features 
of these labeled examples (e.g., evolutionary conservation, 
chromatin signature, DNA sequence). The trained classifier can 
then be used to predict the labels for uncharacterized genomic 
regions (e.g., which ones are enhancers). Our approach employs 
two rounds of a supervised machine-learning technique called 
multiple kernel learning (MKL) [66,67]. MKL is based on the 
theory of SVMs [68], but provides greater flexibility to combine 
diverse data (e.g., evolutionary conservation, sequence motifs, and 
functional genomics data from different cellular contexts) and to 
interpret their relative contributions to the resulting predictions. 
Our implementation of EnhancerFinder applies MKT in two steps 
with the goal of generating a genome-wide set of developmental 
enhancers to better characterize gene regulation during develop- 
ment. The algorithm, which is trained using in vivo validated 
enhancers from the VISTA enhancer database [69] and publicly 
available genomic data, first aims to distinguish human develop- 
mental enhancers from the genomic background and then in a 
second step predicts enhancer tissue specificity. In contrast to most 
other enhancer prediction strategies, which are trained on 
epigenetic marks or sequence motifs that serve as a proxy for a 
subset of all active enhancers, our use of a heterogeneous and in 
vivo validated set of enhancers, enables us to investigate the 
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complex suite of features that underlie active regulatory regions. 
With appropriate training data, EnhancerFinder could be applied 
to study gene regulation at other developmental stages. 

Our analyses demonstrate that EnhancerFinder' s integration of 
diverse types of data from different cellular contexts significandy 
improves prediction of validated enhancers over approaches based 
on a single context or type of data. We find that enhancers active 
in some developmental contexts are easier to identify than others. 
Applying EnhancerFinder to the entire human genome allowed us 
to predict more than 80,000 developmental enhancers, with tissue- 
specific predictions for brain, limb, and heart. These predictions 
significandy overlap known non-coding regulatory regions and are 
enriched near relevant genome-wide association study (GWAS) 
lead single nucleotide polymorphisms (SNPs) and genes expressed 
in the predicted tissue. To illustrate the utility and accuracy of our 
genome-wide enhancer predictions, we used them to investigate 
the enhancer landscape near three developmentally expressed 
genes. First, we screened predicted enhancers near FOXC1 and 
FOXC2 in transgenic zebrafish, and found that 70% (7 of 10) of 
tested EnhancerFinder predictions have confirmed (6) or sugges- 
tive (1) developmental enhancer activity. In addition, we validated 
a novel cranial nerve enhancer near the £EB2 locus using a 
transgenic mouse enhancer assay. Taken together, our results 
suggest that the EnhancerFinder approach of integrating diverse 
data sets significandy improves prediction of biologically active 
enhancers, providing high-confidence candidate enhancers for 
studies in developmental gene regulation. 

Results 

We present EnhancerFinder, a machine learning-based en- 
hancer prediction pipeline that allows the seamless integration of 
feature data from a variety of experimental techniques and 
biological contexts that have previously been used individually to 
predict enhancers (Figure 1). We use MKL to integrate these data. 
MKL algorithms learn a weighted combination of different 
"kernel" functions that quantify the similarity of different feature 
data in order to make predictions. In EnhancerFinder, we use 
three kernels based on different types of biological feature data: 
DNA sequence motifs, evolutionary conservation patterns, and 
functional genomics datasets. 

EnhancerFinder could be used to predict enhancers active at 
any stage and tissue. In this study, we evaluate EnhancerFinder' s 
ability to predict developmental enhancers and their tissue 
specificity. 

A two-step approach to tissue-specific enhancer 
prediction 

Step 1 of our pipeline aims to distinguish all enhancers active in 
the context of interest (e.g., a specific developmental stage) from 
non-enhancer regions. Step 2 then builds classifiers to predict the 
tissues in which the enhancer candidates from Step 1 are active. 
This two-step approach allows us to accurately identify enhancers, 
while also distinguishing their tissues of activity. 

We train and evaluate EnhancerFinder using the VISTA 
Enhancer Browser, which at the time of our analysis contained 
over 700 human sequences with experimentally validated 
enhancer activity in at least one tissue at embryonic day 1 1 .5 
(El 1.5) in transgenic mouse embryos. VISTA also contained a 
similar number of regions without enhancer activity in this 
context. El 1.5 in mouse development roughly corresponds to E41 
(Carnegie stage 17 [70]) in human development. In Step 1 of 
EnhancerFinder, we used all 711 VISTA enhancers as positive 
training data, and for negative training data, we created a set of 



7 1 1 random regions matched to the length and chromosome 
distribution of the positives to represent the genomic background. 
We did not use the VISTA negatives as negative training examples 
in Step 1 , because they are not representative of all non-enhancer 
regions (see below). Our goal in Step 1 is to develop a method that 
can be used to scan the whole genome and distinguish 
developmental enhancer regions from non-enhancer regions. 

The second step of EnhancerFinder aims to distinguish 
enhancers active in a given embryonic tissue from non-enhancers 
and enhancers active in other tissues. We consider all enhancers in 
VISTA with activity in a tissue of interest as positives and all other 
regions in VISTA (including regions not active at El 1.5) as 
negatives (see Methods). This second step that includes enhancers 
active in other tissues as negatives in the training proves to be 
essential for obtaining high specificity in predicting tissue of 
activity (see below), and it is important to do this in two steps 
rather attempting to distinguish enhancers of a given tissue from 
genomic background in one step. 

To evaluate EnhancerFinder, we compared it to several 
commonly used enhancer prediction approaches. Unless otherwise 
noted, we evaluated the performance of all prediction algorithms 
using 10-fold cross validation to compute the area under the curve 
(AUC) for receiver operating characteristic (ROC) curves. We also 
computed precision-recall curves (Figure SI) and compared power 
at a low false positive rate. 

Building a general predictor from a biased training set 

Because EnhancerFinder learns enhancer signatures from a 
training data set, we first explored biases in the VISTA enhancers 
that might affect how well EnhancerFinder could generalize to the 
whole genome. The genomic regions tested by VISTA were not 
selected randomly, and thus their positives do not represent a 
random sample of active enhancers. Nearly all regions tested by 
VISTA are evolutionarily conserved across mammals (706 of 7 1 1 
positives and 727 of 736 negatives). Since our goal is to predict a 
broadly applicable, high confidence set of developmental enhanc- 
ers, we did not include this feature when making genome wide 
predictions. However, with this bias in mind, we did evaluate 
several models that incorporate the degree of evolutionary 
conservation (see below). 

In addition to conservation, several studies deposited in VISTA 
have considered enhancer-associated proteins and histone marks, 
such as p300, H3K27ac, and H3K4mel. We collected all data sets 
of these types from ENCODE and computed their overlap with 
VISTA enhancers. Fewer than half of the VISTA positives are 
marked by all three of p300, H3K27ac, and H3K4mel (from any 
data set), with substantial percentages marked by only one or two 
and 13% (93/711) marked by none (Figure S2). These findings 
indicate that VISTA positives are not highly biased towards a 
single type of ChlP-seq feature, motivating us to include these 
features in our genome-wide predictions, with the caveat that the 
trends we observe for VISTA positives might not generalize to all 
classes of enhancers. Our analysis also suggests that the standard 
practice of equating active enhancers with all regions marked by a 
single ChlP-seq feature, or even the union of overlapping peaks 
from several ChlP-seq experiments, will fail to identify all active 
enhancers in a given context. 

EnhancerFinder integrates diverse data types to 
accurately identify developmental enhancers 

EnhancerFinder predicts enhancers by integrating classifiers 
based on distinct data types. In our first evaluation of 
EnhancerFinder, we consider: functional genomics data, evolu- 
tionary conservation patterns, and DNA sequence motifs. Com- 
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Figure 1 . Overview of the EnhancerFinder enhancer prediction pipeline. In our two-step approach, regions of the genome are characterized 
by diverse features, such as their evolutionary conservation, regulatory protein binding, chromatin modifications, and DNA sequence patterns. For 
each step, appropriate positive (green) and negative (purple) training examples are provided as input to a multiple kernel learning (MKL) algorithm 
that produces a trained classifier. We used 10-fold cross validation to evaluate the performance of all classifiers. In Step 1, we trained a classifier to 
distinguish between known developmental enhancers from VISTA and the genomic background. In Step 2, we trained several classifiers to 
distinguish enhancers active in tissues of interest from those without activity in the tissue according to VISTA. We applied the trained enhancer 
classifier from Step 1 to the entire human genome to produce more than 80,000 developmental enhancer predictions. We then applied the tissue- 
specific enhancer classifiers from Step 2 to further refine our predictions. 
doi:10.1371/journal.pcbi.1003677.g001 



bining these different approaches enables EnhancerFinder to 
accurately distinguish enhancers from the genomic background 
(Figure 2A; AUG = 0.96). 

The functional genomics component of EnhancerFinder (which 
we refer to as All Functional Genomics) is a linear SVM that 
incorporates 2469 datasets generated by the ENCODE project 
and smaller scale studies. These include DNasel hypersensitivity 
data and ChlP-Seq for p300, many histone modifications, and 
many TFs from many adult and embryonic tissues and cell lines 
(Table SI). DNA sequence patterns are integrated via a 4- 
spectrum kernel (DNA Motifs), which summarizes the occur- 
rence of all length four DNA sequences (4-mers) in input regions 
[71]. We found that little was gained by increasing k, considering 
multiple k simultaneously, or incorporating knowledge of tran- 
scription factor binding site (TFBS) motifs as in a previous 
approach [52] (Figures S3 and S4). Finally, evolutionary 
conservation information is incorporated with a linear SVM that 
uses mammalian phastCons scores [72] as features (Evolutionary 
Conservation). 

EnhancerFinder performs significantly better than 
enhancer prediction approaches based on a single type 
of data 

One motivation for developing EnhancerFinder was to explore 
whether combining previous successful approaches to enhancer 
prediction would improve performance. Each of the classifiers 
combined in EnhancerFinder is representative of a different 
strategy for predicting enhancers. Thus, we compared the 
performance of EnhancerFinder to each of its constituents, which 
are SVMs trained on the same enhancer data as EnhancerFinder, 
but using only one type of the data features (e.g., only sequence 
motifs). EnhancerFinder significandy outperformed each of the 
individual classifiers (Figure 2A; p = 2.0E-7 for Evolutionary 
Conservation, p = 2.6E-8 for DNA Motifs, and p = 4.4E-16 for 
All Functional Genomics, McNemar's test), suggesting that 
these different types of data capture unique aspects of enhancers 
that are not completely encompassed by any single data type. 

Not surprisingly, we found that of the three component 
classifiers in EnhancerFinder, Evolutionary Conservation 
yields the best performance (AUC = 0.93). As noted above, nearly 
all regions tested for enhancer activity by VISTA (positives and 
negatives) are evolutionarily conserved compared to the genomic 
background. Nonetheless, considering additional features signifi- 
candy improved predictions. The DNA Motifs (AUC = 0.88) and 
All Functional Genomics (AUC = 0.89) classifiers also exhibit 
strong performance, but also do not perform as well as the 
combined classifier. EnhancerFinder has nearly twice the power of 
any of the individual classifiers at a 5% false positive rate (FPR), 
and its power advantage is even larger at lower FPRs. 

All Functional Genomics, DNA Motifs, and Evolution- 
ary Conservation achieve roughly similar performance from 
different feature data, but each individual classifier predicts a 
somewhat different set of enhancers during evaluation (Figure 2B). 
Roughly two-thirds of the enhancer predictions are shared 



between the three classifiers. The improvement provided by 
combining these data argues that these data sources are indeed 
complementary. 

We also compared EnhancerFinder' s performance with several 
current computational methods used to identify enhancers. We 
were able to make the most direct comparison with CLARE, a 
popular method for identifying enhancers from DNA sequence 
data, i.e., transcription factor binding site motifs and other 
sequence patterns [73]. This approach, which has been success- 
fully applied in several contexts [51,52,53,74], makes few 
assumptions about the input, and is publicly available as a web 
server. On our Step 1 enhancer prediction task, we find that 
CLARE achieves an ROC AUC of 0.79. This is much lower than 
DNA Motifs (AUC = 0.88), our approach based on sequence 
data alone, and the full EnhancerFinder (AUC = 0.96; 
Figure 2C). At a 5% FPR, the power of CLARE is about 20%, 
compared to approximately 30% for DNA Motifs and more than 
60% for EnhancerFinder. 

Comparisons with additional methods were complicated by the 
fact that most were developed in different contexts. We designed 
EnhancerFinder specifically to predict biologically active develop- 
mental enhancers. Most existing approaches focus on data from a 
single cell line and define enhancers based on specific enhancer- 
associated marks or proteins (such as p300 in human embryonic 
stem cells) rather than biological activity. Thus, we did not 
anticipate that they would perform as well as EnhancerFinder at 
developmental enhancer prediction. However, since the predic- 
tions of these methods are commonly used outside the specific 
contexts in which they were made, we believe that it is useful to 
evaluate how well they can identify developmental enhancers and 
how much the EnhancerFinder approach applied to developmen- 
tal enhancers improves on their performance. 

In particular, we compared EnhancerFinder to ChromHMM 
and Segway [64,65], two unsupervised machine learning methods 
for segmenting the genome into a small number of functional 
"states" based on consistent patterns in ENCODE data for 
individual cell lines. The states resulting from the segmentations of 
each cell line's data are annotated by hand into predicted 
functional classes, which include enhancer activity. To evaluate 
these methods, we considered the states overlapping our training 
and testing regions. Any region with an overlapping enhancer state 
was considered a predicted enhancer and all others were predicted 
non-enhancers. In this way, we obtained a single point in ROC 
space for the state predictions. Since there is no score or 
confidence value associated with the state assignments, a full 
ROC curve could not be created for these methods. Figure 2C 
gives the performance for several versions of ChromHMM and 
Segway based on ENCODE data from different cell lines. Both 
methods perform better than random, but considerably worse than 
EnhancerFinder and CLARE (p~0). We stress that, in contrast to 
our supervised method, these methods were not explicitly trained 
to perform the same task as EnhancerFinder, and thus we did not 
expect them to perform as well as EnhancerFinder. Indeed, these 
results argue that their utility in identifying developmental 
enhancers is limited compared to specialized approaches. 
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Figure 2. Combining diverse data using EnhancerFinder 
improves the identification of developmental enhancers. (A) 

Enhancer prediction strategies based on functional genomics data, 
evolutionary conservation, and DNA sequence motif patterns all 
perform well, but EnhancerFinder, which combines these data, provides 
significant improvement over each of them alone (p<2.0E-7 for all). (B) 
Each of the approaches from (A) predicts that somewhat different sets 
of the VISTA regions are enhancers. This suggests that complementary 
information is contained in each data source. EnhancerFinder (not 
shown), which combines them, captures many of the enhancers that 
are unique to each source; it predicts 25 of the 44 enhancers unique to 
Functional Genomics, 30 of the 76 unique to DNA Sequence 
Motifs, and 34 of the 1 1 1 unique to Evolutionary Conservation. (C) 
EnhancerFinder outperforms CLARE, a successful enhancer prediction 
method based on known regulatory motifs. We also evaluated the 
enhancer states predicted by ChromHMM and Segway, two unsuper- 
vised clustering methods that have been used to segment the genome 
into different functional states based on patterns in functional 
genomics data, though these methods were not applied to develop- 
mental contexts. The different X's represent state predictions based on 
data from different ENCODE cell types: GM12878 (blue), H1-hESC 
(violet), HepG2 (brown), HMEC (tan), HSMM (gray), HUVEC (light green), 
K562 (green), NHEK (orange), NHLF (light blue), and all contexts 
combined (red). 

doi:10.1371/journal.pcbi.1003677.g002 



DNA Sequence 
Motifs (714) 




Evolutionary 
Conservation 
(770) 



11 



74 



Functional 
Genomics (643) 



















X 








o 

X 








/ x 
/ x 

/ xe 




Enha 


icerFinder (0.96) 




X* / 

X / 

• / 




CLARE (0.79) 

x ChromHMM Enhancer States 

• Segway Enhancer 

• Segway TF Activity 



0.4 0.6 
False Positive Rate 



Integrating diverse functional genomics data improves 
enhancer prediction 

As illustrated above, our machine learning prediction and 
evaluation framework enabled us to quantitatively explore the 
utility of different genomics datasets in enhancer prediction by 
creating classifiers based on different types of data (i.e., sequence 
motifs, evolutionary conservation, and functional genomics) and 
comparing their performance. We also used this framework to 
investigate other questions about the utility of different subsets of 
these data for enhancer prediction. For example, one might expect 
that some of the datasets included in All Functional Genomics 
(e.g., experiments in cancer cell lines or adult tissues) would not be 
as useful as others (e.g., experiments in embryonic tissues) for 
predicting developmental enhancers, and that limiting the features 
examined by the classifier to the most relevant experiments might 
improve performance. 

To explore this hypothesis, we trained linear SVM classifiers to 
predict VISTA enhancers (as in Step 1 of EnhancerFinder) based 
on different subsets of all the functional genomics features (Table 1) 
and compared their performance. First, we considered a collection 
of 244 datasets from embryonic tissues and cell lines (Embryonic 
Functional Genomics). Next, we created a classifier that 
considers data from a wider range of contexts by training a linear 
SVM using a large, manually curated set of 509 potentially 
relevant functional genomics data sets (Relevant Functional 
Genomics). This set includes embryonic datasets, along with 
additional DNasel and ChlP-seq data from adult tissues and cell 
lines related to the dominant tissues of activity in VISTA. For 
example, we included data from human cardiac myocytes, since 
there are many developmental heart enhancers in our training 
examples. We compared these to the All Functional Genomics 
classifier described above that uses all 2496 functional genomics 
features. 

All Functional Genomics (AUC = 0.89) performed slightly, 
but not significantly, better than Relevant Functional Geno- 
mics (AUC = 0.87; p = 0.16), and both significantly outperformed 
Embryonic Functional Genomics (AUC = 0.83; p = 9.2E-9 
and p = 2.7E-6, respectively) (Figure 3A). At low FPRs, the 
differences in power between these classifiers were modest. The 
Embryonic Functional Genomics classifier included the most 
time-appropriate datasets, yet its performance was improved by 
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including additional data sets that seem less relevant to our 
classification problem a priori. Thus, we conclude that it can be 
advantageous to consider a range of functional genomics features, 
especially when few features are available from the context of 
interest. The utility of these additional data sets might indicate that 
some enhancer features are stable across cell types and develop- 
mental stages, but it could also reflect information these data 
provide about genomic regions that are not active enhancers 
during development (see Discussion). 

Histone marks and p300 provide complementary 
information about enhancer activity 

We also explored the utility of individual functional genomics 
datasets that are often used as proxies for developmental 
enhancers by creating three linear SVM classifiers: H3K27ac, 
H3K4mel, and p300. These SVMs were trained to distinguish 
VISTA positives from the genomic background (Step 1) using all 
available data of the specified type from ENCODE, which include 
a range of cell types and tissues (Table SI). All three classifiers 
performed better than random (Figure 3B). H3K4mel 
(AUC = 0.72) and p300 (AUC = 0.68) performed similarly 
(p = 0.25), with p300 performing best at low FPRs and 
H3K4mel best at higher FPRs. Both significantly outperformed 
H3K27ac (AUG = 0.61; p = 9.4E-15 and p = 5.5E-9, respective- 
ly); however, we caution against extrapolating from this compar- 
ison, since it may reflect biases in the feature sets available and the 
VISTA positives. Since combinations of these features are often 
used to predict enhancers, we next trained a linear SVM classifier 
(Basic Functional Genomics) that includes all three data types 
together. The combined classifier significantly outperforms all the 
individual classifiers (AUC = 0.77; p<2E-7 for each), suggesting 
that each data type contributes unique information about 
enhancer activity. Also, all four SVM classifiers achieved much 
better performance than the common approach of simply 
considering regions overlapping with these data (Figure S5). 

EnhancerFinder also learns weights for individual features 
within classifiers that reflect their contribution to the enhancer 
predictions. We found that features known to be associated with 
enhancer activity in relevant cellular contexts generally receive 
positive weights, while those associated with other types of 
elements received negative weights (Text SI and Figure S6). 

EnhancerFinder's two-step approach enables tissue- 
specific enhancer prediction 

In the previous sections, we focused on generic developmental 
enhancer prediction (Step 1 of EnhancerFinder). Step 2 of 
EnhancerFinder applies a second round of MKL to refine and 
further annotate predicted enhancers from Step 1 (Figure 1). In 
this study, Step 2 consists of training an MKL classifier to 
distinguish VISTA enhancers active in a given tissue from VISTA 



regions without activity in that tissue, i.e., non-enhancers from 
VISTA plus enhancers for other tissues. We did not require that 
the positive training examples be active only in the tissue of interest. 
Using the same feature data as in Step 1 , we created tissue-specific 
classifiers for all tissues with more than 50 examples in VISTA: 
forebrain, midbrain, hindbrain, heart, limb, and neural tube. 

The performance of EnhancerFinder's tissue specificity predic- 
tions varied dramatically between tissues (Figure 4), with the best 
performance for heart (AUC = 0.85), followed by limb 
(AUC = 0.74), forebrain (AUC = 0.72), midbrain (AUC = 0.72), 
hindbrain (AUC = 0.69), and neural tube (AUC = 0.62), which was 
the worst of the tested tissue classifiers, but better than random. 
We combined all brain enhancers into one class, and the 
performance of this generic brain classifier was similar to that of 
the more specific brain classifiers (AUC = 0.73). The Enhancer- 
Finder tissue-specific classifiers trained with all data types 
performed well for most tissues (Table 1); however, classifiers 
based on functional genomics alone often performed as well as the 
full EnhancerFinder classifier, suggesting functional genomics data 
are more informative about developmental enhancer tissue 
specificity than degree of conservation or sequence motifs. 

Most previous efforts to predict tissue-specific enhancers have 
performed a single training step using enhancers or enhancer 
marks present in the tissue of interest as positives and non- 
enhancer regions or the genomic background as negatives. To test 
whether our two-step method improves upon these previous 
approaches, we trained one-step MKL tissue-specific classifiers 
and compared their predicted tissue distributions to those of 
validated enhancers from the VISTA database (Figure 5A). First, 
we trained a set of tissue-specific classifiers using enhancers active 
in each tissue as positives and the genomic background as 
negatives. These classifiers predict very similar sets of enhancers 
regardless of the target tissue; and they vastly overestimate the 
number of enhancers that are active in multiple tissues (95% of 
predictions versus 8% of VISTA) and the number of true 
enhancers of each tissue (Figure 5B). In contrast, classifiers trained 
as in Step 2 of EnhancerFinder, i.e., using tissue-specific enhancers 
as positives and a mix of enhancers active in other tissues and 
regions with no activity in VISTA as negatives, show much greater 
tissue-specificity in their predictions (76%) and a similar amount of 
overlap as among known enhancers (Figure 5C). 

Heart enhancers are easier to identify due to several 
unique attributes 

The relative ease of identifying heart enhancers is likely due to 
several unique characteristics. Known heart enhancers at El 1.5 
are more evolutionarily conserved than genomic background, but 
significantly less conserved than enhancers in other tissues [39,41]. 
In addition, we observed that heart enhancers at this develop- 
mental stage are uniquely close to the nearest transcription start 
site (TSS) (Figure S7). These two patterns are consistent with a 



Table 1. Performance (ROC AUC) of classifiers on each tissue-specific enhancer prediction task (Step 2). 





Heart 


Limb 


Forebrain 


Midbrain 


Hindbrain 


Neural Tube 


Evolutionary Conservation 


0.78 


0.58 


0.52 


0.54 


0.53 


0.52 


DNA Motifs 


0.83 


0.64 


0.66 


0.63 


0.62 


0.60 


Functional Genomics 


0.86 


0.74 


0.72 


0.72 


0.69 


0.62 


Enhancer Finder 


0.85 


0.74 


0.72 


0.72 


0.69 


0.62 
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Figure 3. Integrating diverse functional genomics data improves enhancer prediction. (A) Considering functional genomics features from 
contexts and assays not directly associated with developmental enhancer activity (All Functional Genomics and Relevant Functional 
Genomics) improves the identification of developmental enhancers (p = 9.2E-9 and p = 2.7E-6, respectively, compared to Embryonic Functional 
Genomics only). (B) Combining available H3K4me1, p300, and H3K27ac data, which are commonly used in isolation to identify enhancers, in a linear 
SVM (Basic Functional Genomics) is better able to distinguish known developmental enhancers from the genomic background than considering 
each type of data alone (p<2E-7, for each). However, combining these marks still performs significantly worse than EnhancerFinder (Figure 2A; 
AUC = 0.96) and considering additional data as in (A). 
doi:1 0.1 371 /journal.pcbi.1 003677.g003 
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Figure 4. Enhancers of heart expression are easier to identify 
than enhancers active in other tissues at Ell .5. (A) In Step 2 of 
our prediction pipeline, we trained EnhancerFinder using the same 
features as in Step 1 (Figure 1), but using VISTA enhancers active in a 
given tissue as positives and tested regions that did not show activity in 
the tissue as negatives. Heart enhancers were dramatically easier to 
distinguish from other enhancers than enhancers of expression in other 
tissues. The heart enhancers have significantly higher GC content than 
other enhancers and the genomic background. This and several other 
unique attributes may explain the ease of identifying them (Figures S7 
and S8). In general, functional genomics data are the most informative 
data type for predicting enhancer tissue specificity (Table 1). 
doi:1 0.1 371 /journal.pcbi.1 003677.g004 



recent study of mouse enhancers from different developmental 
stages [75]. Finally, we observed that El 1.5 heart enhancers have 
an unusually high GC content (49%) compared to enhancers of 
other tissues at El 1.5 (~40%). A simple classifier based solely on 
the GC content of a region performs nearly as well as our full 
classifier for heart enhancers (Figure S8). In contrast, sequence- 
based classifiers do not perform well on the other tissues whose 
enhancer GC content is not significantly different from the 
genomic background (Table 1). The high GC content of heart 
enhancers is not due to overlap with CpG islands. Only about 4% 
of VISTA enhancers overlap with a CpG island, and this number 
is consistent across tissues. We also did not find enrichment for any 
known GC-rich transcription factor binding site motifs in VISTA 
heart enhancers. We do see, however, that repeat regions in heart 
enhancers are depleted for the very AT-rich repeats seen in other 
enhancers, and that most of the repeat regions in heart enhancers 
are 40-60% GC. Our results suggest the possible existence of 
unknown GC-rich motifs that may be important for gene 
regulation in the cardiac lineage. 

The heart classifier based on functional genomics data alone 
exhibits strong performance compared to other tissue-specific 
classifiers as well (Table 1). It is possible that this is due to the 
presence of feature data from contexts more relevant to develop- 
mental heart activity than to other tissues, rather than unique 
attributes of the heart enhancers themselves. Indeed, the highest 
weighted features in the heart functional genomics classifier come 
from heart tissues. However, the performance of the heart classifier 
based only on functional genomics data does not decrease 
substantially when we exclude data from the most relevant contexts: 
embryonic heart tissue, adult hearts, and stages of a directed 
differentiation of stem cells into cardiomyocytes (ROC 
AUC = 0.85). Thus, it is possible that feature data from less 
obviously relevant contexts are more informative about heart 
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A VISTA Positive Overlap 




B One Step Prediction Overlap 




Figure 5. EnhancerFinder's two-step approach captures tissue- 
specific attributes of enhancers. (A) The true overlap of human 
enhancers of brain, heart, and limb in the VISTA database. The vast 
majority of characterized enhancers are unique to one of these tissues 
at this stage. For example, of the 84 validated heart enhancers, 71 are 
unique to heart, five are shared with brain, four with limb, and four with 
both. (B) The predicted overlap of VISTA enhancers based on 
predictions made with a single training step using MKL with only 
enhancers of that tissue considered positives and the genomic 
background as negatives. This approach overestimates the number of 
enhancers active in multiple tissues. Each classifier mainly learns general 
attributes of enhancers, rather than tissue-specific attributes. (C) The 
predicted overlap based on EnhancerFinder's two-step approach. These 
predictions are much more tissue-specific and exhibit overlaps between 
tissues similar to the true values (A). Predicted tissue distributions are 
similar when the methods are applied to other genomic regions, as 
illustrated in our genome-wide predictions, but only predictions on 
VISTA enhancers are shown here to enable comparisons to the 
distribution for validated enhancers (A). 
doi:1 0.1 371/journal.pcbi.l 003677.g005 

activity than for other tissues. We suspect that the ease of 
distinguishing heart enhancers may be due to the earlier develop- 
ment of the heart compared to other tissues (see Discussion). 

We predict more than 80,000 developmental enhancers 
across the human genome 

One of the main motivations for developing algorithms that can 
distinguish active enhancers is to apply them to unannotated 
genomic regions to aid the exploration and interpretation of the 
gene regulatory landscape of the human genome (Figure 1). To 
produce a genome-wide set of candidate developmental enhanc- 
ers, we divided the genome into 1.5 kb blocks overlapping one 
another by 500 bp and applied Step 1 of EnhancerFinder to each 
of these regions. EnhancerFinder produces a score for each region; 
positive scores indicate membership in the positive set (enhancers), 
and negative scores indicate membership in the negative set (non- 
enhancers). To focus on high confidence predictions in this 
genome-wide analysis, we used the cross-validation-based evalu- 
ation described above to find a 5 % FPR score threshold, and only 
considered regions exceeding this threshold. After merging 
overlapping positive predictions, we identified 84,301 develop- 
mental enhancers across the human genome with median length 
of 1,500 bp and total genome coverage of 183,695,500 bp 
(5.86%). 

The 5% FPR threshold we used corresponds to a 65% true 
positive rate (TPR). To calculate the false discovery rate (FDR), we 
must estimate the unknown fraction of 1.5 kb blocks of the human 
genome that harbor developmental enhancer regions. If this 
fraction were as high as 50%, a 5% FPR would correspond to a 
9% FDR. If instead we estimate that 10% of 1.5 kb windows 
contain a developmental enhancer, we see an FDR of 47% at a 
5% FPR. While this may seem high, our recent analysis of 
predicted enhancers with human-specific substitution rate accel- 
eration found a lower failure rate at El 1.5 (17%, 5/29) [74], and 
only three often tested predictions did not validate with confirmed 
or suggestive activity in our zebrafish assay (see below). This 
suggests that the FDR may be lower in experimental applications, 
especially when predicted enhancer regions are analyzed in the 
context of other relevant data. However, to accurately measure the 
true FDR would require experimental testing of a very large, 
random set of EnhancerFinder predictions, which is beyond the 
scope of this study. 

In our genome-wide analysis, we used the smaller Relevant 
Functional Genomics data set in order to reduce the 
computational time required. We also did not include evolutionary 
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Figure 6. Predicted tissue-specific enhancers exhibit tissue-specific characteristics. EnhancerFinder identifies thousands of novel high- 
confidence (FPR<0.05) heart, brain, and limb enhancers. These enhancers are enriched for tissue-specific GO Biological Processes. The five most 
enriched GO Biological Processes among genes near each enhancer set (as calculated using GREAT) are listed in the colored boxes. Nearly 90% of 
EnhancerFinder predicted heart, brain, and limb enhancers are unique to a single tissue. The larger number of high-confidence heart enhancers 
relative to brain and limb enhancers is the result of the superior performance of the heart classifier. 
doi:1 0.1 371 /journal.pcbi.1 003677.g006 



conservation data, because the positives in our training data are 
almost universally conserved. While most enhancers likely exhibit 
some evolutionary conservation, this extremely high fraction is 
likely due to bias in the selection of the tested regions in VISTA 
and could reduce our ability to detect less highly conserved novel 
enhancers genome-wide (see Discussion). The resulting conser- 
vation-free classifier still performed extremely well in cross 
validation (AUC = 0.92). Supporting this approach, non-con- 
served regions make up over 20% of our genome-wide enhancer 
predictions. As noted above, we did not observe any other 
dramatic biases in the feature data associated with human VISTA 
enhancers. 

Next, we applied Step 2 of EnhancerFinder to all enhancer 
regions predicted in Step 1. We focused on brain, limb, and 
heart, because these tissues are highly represented in VISTA and 
have been extensively studied in previous analyses of develop- 
mental enhancers. We predicted 7,400 limb enhancers, 19,051 
heart enhancers, and 11,693 brain enhancers (Figure 6) at a 5% 
FPR threshold tuned separately for each tissue. Since Enhancer- 
Finder makes predictions for each tissue independendy, there are 
no constraints on the distribution of tissues in the resulting 
genome-wide predictions. Nonetheless, we find a high level of 
tissue-specificity; nearly 90% of the limb, heart, and brain 
enhancers are predicted to be active in just one of the three 
tissues. 

All genome-wide enhancer predictions are available as tracks 
for import into the UCSC Genome Browser (Data File SI). 



These lists of high-confidence tissue-specific enhancers should 
not be viewed as exhaustive; we found thousands of regions 
with positive, but less significant scores from Step 2 of 
EnhancerFinder. 

Predicted enhancers are associated with relevant 
functional genomic regions 

To characterize and further validate our genome-wide enhancer 
predictions, we examined their genomic distribution with respect 
to several independent indicators of function (details in Text SI). 
Genes near brain and heart enhancers are enriched for expression 
in relevant tissues (Tables S2 and S3). Similarly, Gene Ontology 
(GO) Biological Process enrichment analyses of nearby genes 
suggest that our predicted developmental enhancers target genes 
that function in relevant cell types and tissues (Figure 6). The most 
prevalent transcription factor binding site motifs found in the 
sequences of predicted enhancers differed between enhancers of 
different tissues and included many relevant developmental TFs 
(Table S4). Finally, our predicted enhancers contain 676 lead 
SNPs associated with significant effects in GWAS (Table S5); this 
is significantly more than expected at random (permutation p< 
0.001). 

Taken together, these analyses suggest that EnhancerFinder 
identifies many active regulatory regions that contain functionally 
relevant variation. Our tissue-specific enhancer predictions give 
valuable annotations to thousands of non-coding regions of the 
human genome that had not previously been linked to develop- 
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Figure 7. Four novel developmental enhancers near FOXC2. This UCSC Genome Browser (http://genome.ucsc.edu) snapshot shows the 
genomic context of four candidate human enhancers tested in transgenic zebrafish. For each enhancer, we show a zebrafish image that is 
representative of the reproducible expression patterns. F0XC2 Enhancer Candidate 1 (F2EC-1) drives expression at 48 hpf in the eye and epidermis 
(arrows). F2EC-2 shows expression at 24 hpf in the forebrain, midbrain, and nerve. F2EC-3 drives expression at 48 hpf in the epidermis and heart. 
F2EC-4 shows expression at 48 hpf in the notochord, spinal cord, and heart. See Table S6 for full list of expressed tissues seen in each candidate 
enhancer and Figure S10 for results on candidate enhancers near F0XC1. 
doi:1 0.1 371 /journal.pcbi.1 003677.g007 



mental regulation. For example, thousands of SNPs associated 
with disease by GWAS are in non-coding regions with limited 
functional annotations [76]. Our genome-wide enhancer predic- 
tions provide a resource for exploring the mechanisms and 
functional effects of these uncharacterized GWAS hits. 

EnhancerFinder predictions function as enhancers in the 
developing embryo 

To demonstrate that genome-wide EnhancerFinder predictions 
can facilitate the discovery of functional regulatory elements, we 
present two case studies in which we identify and validate novel 
enhancers near genes active during development. 

EnhancerFinder identifies many novel enhancers near 
FOXC1 and FOXC2. To evaluate several EnhancerFinder 
predictions, we took advantage of a transgenic enhancer assay 
in embryonic zebrafish (Methods). We tested enhancer activity of 
ten predicted human enhancers near F0XC1 and F0XC2, two 
forkhead box TFs. The mouse homologs Foxcl and Foxc2 have 
been studied extensively and have been shown to be required for 
proper embryonic development; Foxcl null and Foxc2 null 
mutants are pre- or perinatal lethal [77,78,79]. In humans, 
complete lack of F0XC1 is also typically pre- or perinatal lethal, 
and deletions near and point mutations in F0XC1 contribute to 
eye and brain development disorders [80,81]. Figure 7 shows the 
genomic context of F0XC2, along with the candidate enhancers 
that we tested (F0XC2 Enhancer Candidates, or F2ECs). F0XC1 
results are shown in Supplementary Figure S10 (F0XG1 
Enhancer Candidates, or FlECs). Six of the ten predicted 
human enhancer sequences showed consistent enhancer activity 
in zebrafish at 24 or 48 hours post fertilization (hpf) (F1EC-1, 
F1EC-6, F2EC-1, F2EC-2, F2EC-3, and F2EC-4). One addi- 
tional candidate enhancer (F1EC-3) showed suggestive enhancer 
activity. EnhancerFinder predicted tissue specificity for eight of 
the ten candidate enhancers, and we saw the predicted 



expression pattern confirmed for just one candidate enhancer 
(F2EC-3, predicted heart enhancer), and suggestive expression 
for another (F1EC-6, predicted heart enhancer). However, it is 
difficult to interpret this result, since the tested stages (24 and 
48 hpf) do not directly correspond to single stages of mammalian 
development, and some of the studied tissues are not homolo- 
gous. Also, since we tested predicted human enhancer sequences 
in zebrafish, it is possible that differences in developmental 
regulation between human and fish contributed to this result. 

EnhancerFinder predictions highlight a novel enhancer 
near ZEB2. Next, we sought to investigate a novel enhancer 
prediction in a mammalian system. We selected the locus 
containing ^EB2, a zinc finger E-box-binding homeobox-2 TF, 
which has many roles throughout embryonic and postnatal 
development, in particular in cortical neurogenesis 
[82,83,84,85]. Mutations in Z EB2 are associated with Mo- 
wat-Wilson syndrome, a complex developmental disorder [86] . 
However, relatively little is known about the genetic mecha- 
nisms that orchestrate ^FB2\ expression. A long-range 
enhancer of postnatal expression in developing kidney cells 
(El in Figure 8) was recently discovered 1.2 megabases (Mb) 
downstream of 2JLB2 in the adjacent gene desert [87]. Since 
this enhancer does not fully recapitulate the expression timing 
and domains of ^EB2, the authors speculated that the gene has 
many other, potentially long-range, enhancers. Supporting this 
theory, there are two validated El 1.5 brain enhancers near 
ZEB2 in the VISTA Enhancer Browser (Figure 8, VISTA 
hs407 and VISTA hsl802). Finally, there is an enrichment of 
human accelerated regions (HARs) [88,89] near ^EB2, 
suggesting that it may have human-specific regulatory patterns. 

Our EnhancerFinder predictions support the existence of a rich 
regulatory program specified in the non-coding sequence nearby 
ZEB2; there are 54 predicted enhancers for which it is the nearest 
TSS. This puts J(EB2 in the top 0.2% of all genes with respect to 
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Figure 8. A novel cranial nerve enhancer in the ZEB2 locus. This UCSC Genome Browser snapshot shows a dense region of predicted 
enhancers in a 1.5 Mb window on human chromosome 2 including ZEB2 and part of the adjacent gene desert. Tracks give the locations of four 
human accelerated regions (HARs), two validated VISTA enhancers (hs407 and hs1802), and the E1 region recently shown to have postnatal enhancer 
activity [87]. The inset shows a zoomed in view of ZEB2 (hg19.chr2:145,100,000-145,425,000) along with summaries of several ENCODE functional 
genomics datasets and evolutionary conservation across placental mammals. We tested the predicted enhancer overlapping 2xHAR.240 for enhancer 
activity at E1 1.5 in transgenic mice. Both the human and chimp versions of this sequence drive consistent expression in the cranial nerve (Figure S1 1). 
doi:1 0.1 371 /journal.pcbi.1 003677.g008 



the number of adjacent enhancer predictions. Supporting the 
validity of our predictions, the known VISTA enhancers both 
overlap EnhancerFinder predicted enhancers, while the regions 
known to be inactive or active at later postnatal developmental 
stages (El) [87] do not 

We selected an EnhancerFinder predicted enhancer (indicated 
in the zoomed pane of Figure 8) for further experimental analysis 
due to its high EnhancerFinder score and overlap with a HAR 
(2xHAR.240). We interrogated the potential of the human and 
chimp sequences at this region to drive gene expression at El 1 .5 in 
transient transgenic mouse embryos. All seven embryos with 
staining showed cranial nerve expression (Figure 8 red box; Figure 
SI 1), regardless of whether the construct contained the human or 
chimp sequence. Thus, we have identified a novel enhancer within 
the ^EB2 locus that overlaps one of its expression domains; 
however, whether this enhancer targets remains to be 

proven. 

This is not the only HAR enhancer validated to date. In a 
recent publication, we showed that many HARs function as 
developmental enhancers [90]. In that study, we experimentally 
tested 29 HARs that EnhancerFinder predicts to function as 
developmental enhancers, and found, in agreement with the cross- 
validation and zebrafish experimental validation rates here, that 
24 of the regions (83%) show positive enhancer activity at El 1.5. 



In addition, one EnhancerFinder negative showed no enhancer 
activity. 

While none of the enhancer predictions tested so far were 
randomly selected, our results suggest that EnhancerFinder is a 
powerful tool for accurately characterizing developmental regula- 
tory potential in many useful contexts. Our enhancer predictions 
highlight many additional candidates for further investigation, and 
we believe that they will enable similar analyses of the regulatory 
potential of many other genes and regions of interest. 

Discussion 

In this study, we developed EnhancerFinder, a new machine- 
learning framework for predicting regulatory enhancers from 
diverse data sources. In contrast to most previous enhancer 
identification strategies, which have based their predictions on one 
or a small number of data types, EnhancerFinder enables us to 
flexibly integrate the large and continually expanding collection of 
evolutionary, DNA sequence, and functional genomics data that 
are informative about enhancer function. Our analysis of the 
EnhancerFinder algorithm and its predictions makes three major 
contributions. First, we demonstrate that integrating diverse types 
of data from many cellular contexts, including some unexpected 
ones, can accurately predict in vivo validated developmental 
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enhancers. Second, we show that a two-step approach in which 
enhancer tissue-specificity is individually evaluated after general 
enhancer prediction improves the identification of enhancers' 
tissues of activity. Finally, our genome-wide developmental 
enhancer annotations, including tissue-specific predictions for 
heart, brain, and limb, assign novel functions in development to 
thousands of genomic regions. We show that these predictions are 
enriched for a number of independent indicators of regulatory 
functions. As a result, we expect our predictions to prove useful in 
the annotation of non-coding genomic regions, as illustrated in the 
identification of novel enhancers near ^EB2, F0XC1, and F0XC2. 
Our genome-wide predictions are freely available as a UCSC 
Genome Browser track. 

A biologically active in vivo definition of "enhancer" 

We chose to define developmental enhancers for training as 
genomic regions that are experimentally shown to activate gene 
expression in vivo in embryonic mouse assays. We believe that this 
definition is better suited to identifying regions for further 
exploration and experimental characterization than approaches 
based on single data sources, such as p300, H3K4mel, or 
H3K27ac, associated with enhancers in individual cell lines. We 
showed that our predicted enhancers, based on this biologically 
active definition, significantly overlap data sets commonly used as 
proxies for enhancer activity, such as H3K27ac and p300 binding. 
However, these other data alone are not sufficient to identify all 
enhancers, as we demonstrated for H3K27ac, H3K4mel, and 
p300 in Figure 3B. Similarly, when we evaluated the ability of 
other computational methods to identify enhancers, we find that 
they perform better than random, but that EnhancerFinder 
significandy outperforms them at identifying biologically active 
developmental enhancers. This is not surprising given the different 
contexts in which some enhancer predictions, such as those from 
ChromHMM and Segway, were developed. 

While EnhancerFinder could be used to predict enhancers in 
well-characterized cell lines, it is particularly useful at identifying 
enhancers in complex tissues that contain multiple cell types and in 
cell types that do not have much specific functional genomics data 
available. Other computational approaches to enhancer prediction 
have focused on identifying enhancers in individual cell types using 
functional genomics data from the same cells [56] or using the 
differences in cell type specific transcription factor binding to 
identify cell-type specific binding motifs [61]. These methods 
generally perform well, but they do not address enhancer 
prediction in cell types with little or no functional genomics data, 
or in tissues that contain multiple cell types. 

Why do seemingly irrelevant data improve our enhancer 
predictions? 

Data such as p300 binding sites and H3K4mel have been used 
in previous studies to identify enhancers, and these data are major 
contributors to our enhancer predictions. However, data from 
other sources and contexts less directiy associated with enhancer 
activity provide complementary information that improves our 
predictions. Some of these data may be negatively correlated with 
enhancer activity, allowing EnhancerFinder to learn what features 
distinguish regions that are not developmental enhancers. Others 
may help reinforce patterns present in data from more relevant 
contexts, reflecting some degree of stability in the features of 
enhancer regions across developmental stages and cell types. For 
example, we found that features measured in embryonic stem cells 
are quite useful for El 1.5 enhancer prediction; their removal from 
the classifier degrades performance and/or they have large 
(positive or negative) MKL weights. Examination of these features 



suggests that some identify "poised" regions that will become 
active enhancers upon differentiation, while others seem to help 
distinguish stem cell enhancers (i.e., non-enhancers at El 1.5) from 
those specific to differentiated lineages. We note that despite these 
interesting observations, most individual functional genomics 
features do not carry a great deal of information and the power 
of EnhancerFinder comes from the integration of different types of 
data. It is also possible that as a more complete experimental 
characterization of chromatin state and protein-DNA binding 
from El 1 .5 tissues is obtained, data from less relevant contexts will 
not provide as much improvement as it did in this study. 

What data are most informative about enhancer activity? 

We focused on a single developmental stage with a large 
number of validated enhancers. To efficiendy extend enhancer 
detection and validation to new contexts, it will be very important 
to select the most informative data to collect. Even though the 
ENCODE project has produced an impressive amount of data, it 
still has not extensively assayed most contexts of interest to 
researchers, in particular developmental biologists. The perfor- 
mance of classifiers trained on subsets of all our data and the 
weights we learned for feature sets and individual features provide 
some guidance for future experiments. Evolutionary conservation 
and DNA sequence patterns are broadly useful in the identifica- 
tion of enhancers, but our results suggest that adding functional 
genomics data is necessary to make more precise predictions about 
the contexts of activity. H3K4mel and p300 are two of the most 
useful functional genomics data types overall (Figure S6), but many 
others are useful in particular contexts. However, the non-random 
sampling of functional genomics data and enhancers makes 
definitively determining the relative utility of different data types 
challenging. 

Why are heart enhancers easier to predict than other 
types of enhancers? 

We saw a broad range in our ability to predict the tissue 
specificity of enhancers from existing data. Heart enhancers were 
dramatically easier to identify than other tissue-specific enhancers. 
Heart enhancers have significantly higher GC content than 
enhancers of other tissues, are less evolutionarily conserved, and 
are closer to the nearest TSS than other known enhancers at 
El 1.5, and we show that GC content alone is sufficient to 
accurately predict many heart enhancers (Figures S7 and S8). 
However, functional genomics data alone were also able to 
accurately predict heart enhancers. The underlying biological 
explanation for these patterns may have to do with relative 
developmental age of different organs and tissues. At El 1.5, the 
heart is further along its developmental trajectory than the other 
tissues considered, and heart enhancers have completed their most 
conserved developmental stage, whereas forebrain enhancers are 
most strongly conserved at El 1.5 and E14.5 [75]. At El 1.5, many 
of the less conserved, mammal-specific features of the heart are 
developing [91,92], whereas other tissues are still developing under 
more general, less species-specific conserved regulatory programs 
at El 1.5 [93]. A recent study of enhancers in the adult mouse 
retina found that high local GC content was strongly correlated 
with enhancer activity [94]. Paired with our result, this suggests 
that GC content is a distinguishing feature of certain classes of 
enhancers. 

Limitations of our approach 

In spite of the strong overall performance of EnhancerFinder at 
predicting tissue-specific developmental enhancers, our approach 
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has some limitations. First, we rely heavily upon the VISTA 
Enhancer Browser for training examples, because it is the largest 
collection of validated mammalian enhancers currently available. 
This resource provides an impressive catalog of validated human 
regulatory enhancers, but it is limited to a single developmental 
stage and experimental system. Without more data and analysis, it 
is difficult to evaluate how specific our predictions are to this 
context. Applying EnhancerFinder to known enhancers in model 
organisms, such as zebrafish and fly, would provide additional 
opportunities to evaluate our approach and findings, while 
potentially demonstrating differences in how enhancers function 
in these different species. 

Second, most of the enhancers present in VISTA are 
evolutionarily conserved. As a result, the VISTA enhancers 
cannot be viewed as an exhaustive catalog of the full range of 
enhancers. However, these regions have validated enhancer 
activity in vivo, and thus provide an appealing alternative to 
approaches that use single-mark proxies for enhancer activity (e.g., 
considering all H3K27ac peaks as active enhancer regions). In 
addition to being conserved, these regions contain many signatures 
of enhancers in their sequence motifs and functional genomics 
composition that are useful for predicting enhancers. To 
emphasize these features and mitigate the impact of bias towards 
conserved regions, we removed evolutionary conservation as a 
feature from EnhancerFinder when we applied it to predict 
enhancers genome-wide. Our goal in doing so was to improve our 
ability to discern less conserved enhancers in these genome-wide 
predictions, and indeed, we predicted thousands of non-conserved 
enhancers (~20% of all predictions). 

Third, though our predictions are based on a large collection of 
genome-wide chromatin state, protein-binding, and sequence 
information from many contexts, we are still limited by data 
availability. Even with the impressive efforts of ENCODE and 
related projects, producing data that are perfectly matched to all 
contexts of interest is time consuming and sometimes impossible, 
especially when studying humans. Thus, it will be important to 
develop a principled understanding of how different data can be 
generalized across tissues, developmental stages, and between 
species. In our analysis, many of the highest weighted features 
come from contexts close to the developmental stage of interest, 
and thus we anticipate that gathering more data from develop- 
mentally relevant cells and tissues will significantly improve our 
ability to annotate genomic regions involved in the regulation of 
embryonic development. However, data from other, seemingly 
unrelated, contexts may continue to prove useful. 

Extensions and future applications 

This study annotates regulatory elements in the human 
genome and provides tools for interpreting the effects of 
mutations in non-coding regions. Our case studies on regions 
around Z EB 2, F0XC1, and F0XC2 illustrate how our 
predictions can facilitate the rapid identification of novel 
enhancers. In addition, the statistical enrichment for GWAS 
SNPs in our genome-wide enhancer predictions suggests that 
they may be a good resource for pinpointing causal mutations 
in potential disease loci. 

EnhancerFinder is a general framework for enhancer prediction 
and evaluation of different data sources that aim to annotate the 
regulatory functions of the human genome. It could easily be 
extended to include additional types of data, such as population- 
level variation at each locus, information about the three- 
dimensional state of the genome from Hi-C and 5C, and 
predictions of potential target genes for each enhancer. It could 
also be used to analyze additional aspects of the data we already 



consider, such as accounting for the relative genomic position of 
different features [66]. 

The EnhancerFinder two-step approach enables delineation of 
features common to all enhancers versus those that characterize 
enhancers of different types. For example, we find that predicting 
enhancers that are unique to a single tissue is more difficult than 
those that are active in multiple tissues (Figure S9), that certain 
features make prediction of heart enhancers particularly easy, and 
that different features are selected in classifiers for general 
enhancers and those for specific tissues. Together, these results 
suggest that there may be distinct classes of enhancers, even 
among those active in a given tissue at a single developmental 
stage. Further analysis of EnhancerFinder classifiers based on 
different types of data may help suggest biological mechanisms 
underlying the functional distinctions and genomic features of 
these different classes of enhancers. 

Methods 

Ethics statement 

Transgenic mice were generated by Cyagen Biosciences 
(http://www.cyagen.com/). Their facility meets and often 
exceeds animal health and welfare guidelines. Animals were 
euthanized using techniques recommended by the American 
Veterinary Medical Association. All procedures were carried 
out in line with Gladstone Institutes and University of 
California guidelines. All zebrafish work was approved by 
the UCSF Institutional Animal Care and Use Committee 
(protocol number AN 100466). 

Genomic data 

All work presented in this paper is based on the February 2009 
assembly of the human genome (GRCh37/hgl9) downloaded 
from the UCSC Genome Browser (http://genome.ucsc.edu/). 
Any data that was not in reference to this build was mapped over 
using the liftOver tool from the UCSC Kent tools (http:// 
hgdownload.cse.ucsc.edu/admin/jksrc.zip). 

Multiple kernel learning-based prediction of 
developmental enhancers 

In our framework, genomic regions are associated with a 
common set of descriptive features. We then apply machine- 
learning algorithms that use the features of known training 
examples to learn a function of the feature data that distinguishes 
the positives (enhancers) from the negatives (non-enhancers). This 
function can then be applied to the features associated with 
uncharacterized genomic regions to predict their enhancer status. 
A positive score for a genomic region indicates predicted 
membership in the positive class (enhancers) and a negative score 
indicates predicted membership in the negative class (non- 
enhancers). 

Training examples. We obtained all of our positive training 
data and our tissue-specific negative training data from the VISTA 
Enhancer Browser [69] on April 4, 2012. We downloaded the 
location, DNA sequence, and expression contexts for all human 
sequences tested in the VISTA mouse El 1.5 enhancer screen. 
This consisted of 711 validated human enhancers and 736 
genomic regions that did not exhibit enhancer activity in this 
context (http://enhancer.lbl.gov/). The median length of the 
enhancers in VISTA is 1,545 bp. 

In the first step of EnhancerFinder (Figure 1), we used all 711 
VISTA enhancers as positive training data. For negative training 
data, we generated a set of 7 1 1 random genomic regions matched 
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to the length and chromosome distribution of the positives, and 
filtered to remove known VISTA enhancers and assembly gaps. 

In the second step of EnhancerFinder, we used tissue-specific 
subsets of the 1,447 VISTA regions for training. For example, 
when predicting heart enhancers, our positive training data were 
the 84 VISTA regions with heart expression in El 1.5 mice, and 
our negative training data were the remaining 1,363 VISTA 
regions that were tested and showed no heart expression at El 1.5, 
even though they may be enhancers in other tissues or none at all. 
We did not require that a region be active only in the tissue of 
interest. We included the VISTA negatives in this analysis, 
because they share many attributes in common with known 
enhancers and may have enhancer activity in contexts other than 
El 1.5. Our results did not change dramatically when the VISTA 
negatives were not included in the training. We trained tissue- 
specific classifiers for the six tissues with more than 50 examples in 
VISTA: forebrain, midbrain, hindbrain, heart, limb, and neural 
tube. We also trained a brain enhancer classifier on the combined 
the forebrain, midbrain, and hindbrain enhancers. 

Feature data. We considered three main types of data as 
features in our analysis: functional genomics data, evolutionary 
conservation, and DNA sequence motifs. We obtained our 
functional genomics feature data from the ENCODE data 
repository at the UCSC Genome Browser (http:/ /genome.ucsc. 
edu/ENCODE/ and [95]). These data include histone modifica- 
tions, such as H3K4mel, H3K4me3, H3K27ac, protein-DNA 
associations for many TFs and p300, and several measurements of 
open chromatin (DNasel hypersensitivity, FAIRE, digital genomic 
footprinting), from hundreds of cell types [95] . We also included 
heart p300 data from [39]. For a full list of the functional genomics 
data considered, see Table S 1 . We associated each genomic region 
with a binary vector that represents the presence or absence of 
overlap with each functional genomics data set. To determine this 
feature vector, we intersected the genomic location of the region of 
interest with the peaks defined by the original researchers (from 
the broadPeak or narrowPeak files) using inter.sectBed [96]. We 
found that considering non-binary functional genomics features 
based on experimental data, like the density of sequence reads 
from a ChlP-seq study, did not significantly improve performance 
(data not shown). However, we suspect that with consistent peak 
calling and appropriate normalization this might be an avenue for 
future improvement. 

To summarize the DNA sequence motif patterns in a genomic 
region, we calculated the number of occurrences of all possible 4- 
mers in the sequence. 

Evolutionary conservation estimates were taken from the 
mammalian phastCons elements [72] obtained from the phastCon- 
sElements46wayPlacental track in UCSC Genome Browser. Each 
genomic region was assigned its maximum overlapping phastCons 
score or zero if it did not overlap any phastCons elements. 

Machine-learning algorithms. EnhancerFinder is an ex- 
tension of the SVM supervised learning framework that allows the 
integration of multiple data types into a single discrimination 
function. Standard 1-norm MKL augments the usual SVM 
discrimination function, f, with additional parameters, ftj, that 
weight the contribution of each kernel function kj. 

N M 

where JV is the number of training examples, M is the number of 
kernels, a, are the training example weights, and b is the bias [66] . 
We include three kernel functions in EnhancerFinder, each of 



which corresponds to one of the three types of feature data 
described above. These kernels quantify the similarity of the 
features of the appropriate type for any two genomic regions. To 
combine the kernels, the MKL algorithm simultaneously learns 
weights for the associated kernels, in addition to learning the bias 
and weights for each training example as in a standard SVM. We 
use the 4-spectrum kernel [71] for our sequence features; this 
kernel has been shown to perform well in a variety of DNA 
sequence-based prediction tasks including enhancer prediction 
[54]. For the functional genomics and evolutionary conservation 
data, we use linear kernels, which are equivalent to dot products of 
the feature vectors. We explored the use of alternative, non-linear 
kernels for these features and found that they performed similarly 
(data not shown). Each kernel was variance normalized, and we 
balanced the misclassification costs by class size [97]. In addition 
to EnhancerFinder classifiers, we also trained and evaluated the 
constituent single kernel SVMs. All analyses were performed using 
the implementation of SVMs and MKL in the SHOGUN 
Machine Learning Toolbox vl.1.0 [98]. 

Performance evaluations 

To evaluate the performance of trained classifiers, we 
performed 10-fold cross-validation on the training data and 
quantified our results with ROC AUG, precision-recall curves, 
and power estimates at fixed false positive rates. We computed p- 
values for the difference in performance between classification 
methods using McNemar's test [99,100]. To estimate false 
discovery rates, we trained EnhancerFinder classifiers at 1:1, 
1:10, and 1:100 ratios of positive to negative enhancers and used 
the resulting 10-fold cross-validation results to calculate the 
proportion of false discoveries genome-wide at a 5 % FPR if the 
true proportion of 1.5 kb windows containing an enhancer was 
50%, 10%, or 1%. 

Comparison to existing enhancer prediction methods 

We compared EnhancerFinder's predictions to those of several 
previous enhancer prediction methods. We obtained the perfor- 
mance of CLARE on our Step 1 prediction task, by inputting our 
positive and negative data into the CLARE web server [73]. We 
downloaded the genomic segmentations and annotations pro- 
duced by ChromHMM [64] and Segway [65] . We considered the 
ChromHMM predictions based on different ENCODE cell lines 
both individually and together. Any genomic region in our 
evaluation data set that overlapped an enhancer state was 
considered a predicted enhancer, and all others were considered 
predicted non-enhancers. For Segway, we also considered the "TF 
activity" state. 

Identification of tissue-specific enhancers across the 
human genome 

We predicted tissue-specific developmental enhancers through- 
out the human genome by applying a trained MKL classifier (Step 
1 of EnhancerFinder) without conservation (see Results) to sliding 
windows of 1500 bp, moving along the human genome in 500 bp 
steps. The feature profile for each window was computed as 
described above. To focus on high-confidence predictions, we 
filtered the enhancer scores for the windows at a 5% FPR, 
estimated from cross-validation using the genomic background, 
and combined the remaining overlapping windows to produce 
84,301 high-confidence predicted enhancers. 

To predict tissue specificity, we applied trained brain, limb, and 
heart classifiers (Step 2 of EnhancerFinder) without conservation 
to all 299,039 windows with positive enhancer scores in Step 1 . We 
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then applied a 5% FPR cutoff for each tissue and concatenated the 
remaining overlapping windows into merged enhancer regions. 
Using this approach, we predicted 19,051 heart enhancers, 1 1,693 
brain enhancers, and 7,400 limb enhancers. 

Analysis of genome-wide tissue-specific enhancer 
predictions 

We characterized the expression patterns of the gene nearest to 
each predicted enhancer using the GNF Adas 2 [101]. It contains 
expression data for genes in 79 different tissues, with expression 
measured using Affymetrix microarrays. For each of these 79 
tissues, we used a paired t-test to determine if the nearest genes of 
predicted heart enhancers had significantly different mean values 
of expression than the nearest genes of brain enhancers. We did 
not include the limb enhancers in this analysis due to the lack of 
relevant expression data in the GNF Adas 2. 

We examined genomic regions near predicted developmental 
enhancers for enrichment of Gene Ontology functional annota- 
tions, known phenotypes, and pathways using GREAT [102]. 
Results were computed using the hypergeometric test for genome- 
wide significance, with the default settings and the "basal plus 
extension" association rule (proximal 5 kb upstream, 1 kb 
downstream, plus distal up to 100 kb). 

We identified the sequence motifs present in each set of 
enhancers using the FIMO tool (Find Individual Motif Occur- 
rences) from the MEME Suite of sequence motif analysis tools 
[103]. We considered known transcription factor binding motifs 
from the April 201 1 release of the TRANSFAC database with a 
FIMO score threshold of 10e-5. We identified those occurrences 
that fell in predicted enhancers, and summarized motifs to identify 
the most prevalent TFs in each tissue-specific set of enhancers. 

We analyzed the overlap of predicted enhancers with GWAS 
SNPs, based on the NHGRI catalog of 9,687 GWAS SNPs 
downloaded from the UCSC Genome Browser in October 2012. 
Unadjusted permutation p-values were calculated by randomizing 
genomic locations of predicted enhancers (matching for length and 
chromosome, and avoiding assembly gaps) and overlapping these 
randomized regions with GWAS SNPs to assess significance of 
overlapping regions. 

Transgenic enhancer assays 

Mouse enhancer assays were carried out in transient transgenic 
mouse embryos generated by pronuclear injections of enhancer 
assay constructs into FVB embryos (Cyagen Biosciences). Human 
and chimpanzee DNA sequences were inserted upstream of a 
minimal promoter Hsp68 and a Lac^ reporter gene. The human 
sequence was amplified using primers 5'-TGTAT- 
GAAACCTGTTCACTCTCC-3' and 5'-GCTTAAAACAAC- 
TACTAGAATCAGGC-3 ' from the bacterial artificial chromo- 
some (BAC) RP11-107E5 (from the BacPac resource at CHORI). 
The chimpanzee sequence was amplified using primers 5'- 
TGTATGAAACGTGTTCACTCTCC-3' and 5'-GCTTAAAA- 
CAACTACTAGAATCAGGC-3' from BAC CH251-677E03a 
(CHORI). The embryos were collected and stained for Lac^ 
expression at El 1.5. 

Following the annotation policies of the VISTA Enhancer 
Browser, we required that consistent spatial expression patterns be 
present in three or more embryos with staining in order for the 
region to be considered an enhancer. 

Zebrafish enhancer assays were performed in transient trans- 
genic zebrafish embryos. We tested candidate enhancer regions 
that ranged in length from 987 bp to 3,633 bp (see Table S6 for 
hgl9 genomic coordinates), which we manually demarcated from 
within larger predicted enhancer regions based on signatures of 



likely enhancer function (including Dnasel hypersensitivity sites, 
transcription factor binding sites, histone modifications, and 
conservation). 

We performed PCR to obtain the candidate enhancer sequence 
using human genomic DNA (Roche). These were cloned into the 
Elb-GFP-Tol2 enhancer assay vector containing an Elb minimal 
promoter followed by GFP [104], and the construct was verified 
by sequencing. Each construct was injected with Tol2 mRNA into 
at least 100 single-cell fertilized zebrafish embryos. We annotated 
GFP expression at approximately 24 and 48 hours post fertiliza- 
tion (hpl), and considered an enhancer to be positive if we 
observed consistent expression in at least 15% of all fish alive at 
either 24 or 48 hpf [105], and suggestive of enhancer activity if we 
observed consistent expression in at least 10% of all fish alive at 24 
or 48 hpf, after subtracting out percentages of tissue expression in 
fish injected with the empty enhancer vector. For each construct, 
at least 50 fish were analyzed for GFP expression at 48 hpf. 

Supporting Information 

Figure SI Precision-Recall curves corresponding to all 
ROC curves presented in the main text. (A) Figure 2A (B) 
Figure 3A (C) Figure 3B (D) Figure 4. A PR curve could not be 
created for Figure 2C, because we could not obtain the raw scores 
for regions from the CLARE web server. 
(PDF) 

Figure S2 VISTA enhancers overlap many common 
marks of enhancers, but no common mark is universal 
to all VISTA enhancers. We computed the overlap between 
7 1 1 VISTA enhancers and three common functional genomic 
marks of enhancers and found that 450 enhancers overlap 
H3K27ac (in any of 16 datasets from ENCODE), 563 overlap 
H3K4mel (in any of 15 datasets from ENCODE), and 404 
overlap p300/CBP (in any of 35 datasets from ENCODE and 
human tissues). Fewer than half of the enhancers (306) overlap all 
three common marks of enhancers, and 93 do not overlap any of 
those three functional genomics marks. All but five of the VISTA 
enhancers overlap a conservation peak (phastCons 46-way 
placental mammal). Four of these non-conserved enhancers 
overlap all three functional genomics marks, and one non- 
conserved enhancer overlaps just H3K27ac and H3K4mel. 
(PDF) 

Figure S3 The 4-spectrum kernel performs competi- 
tively with other k-spectrum kernels and the combina- 
tion of k-spectrum kernels. We analyzed the ability of 
spectrum kernels based on k-mer lengths between 2 and 8 to 
distinguish enhancers from the genomic background (Step 1). K- 
mers between 4 and 7 had the best performance. We also 
evaluated an MKL algorithm that combined each k-spectrum 
kernel, and it did not provide significant improvement over the 
best individual kernels. 
(PDF) 

Figure S4 Considering known TFBS motifs does not 
improve the 4-spectrum kernel. Considering the number of 
occurrences of known TFBS motifs as features has recently been 
used in a linear SVM framework to predict enhancers [52]. To 
evaluate the utility of this approach, instead of and in addition to 
considering all k-mers, we created a linear SVM that used the 
number of hits to 1022 TF binding site matrices from 
TRANSFAC and JASPAR as computed by FIMO as features. 
That is the feature vector for each region consisted of 1022 
elements, each of which was the number of significant hits for a 
different TF motif. This TFBS linear SVM (AUC = 0.81) did not 
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perform as well as the 4-spectrum kernel (AUC = 0.88). We also 
evaluated an MKL algorithm that combined the 4-spectrum and 
TFBS kernels. This combined kernel did not perform any better 
than the 4-spectrum kernel suggesting that, at least under this 
encoding, TFBS motifs do not provide significant additional 
benefit in distinguishing enhancers from the genomic back- 
ground. 
(PDF) 

Figure S5 Combining functional genomics data with an 
SVM outperforms simply considering regions overlap- 
ping these data. The four solid lines shown are the same as in 
Figure 3B; they summarize the performance of these methods at 
distinguishing VISTA enhancers from the genomic background 
(Step 1). The X's give the performance of approaches that consider 
all regions overlapping a given feature as positives and all others as 
negatives. The + and * indicate the performance obtained by 
considering the union and intersection of H3K4mel, p300, and 
H3K27ac, respectively. For each feature, the linear SVM achieves 
better performance than simply considering all overlapping 
regions as positives. 
(PDF) 

Figure S6 EnhancerFinder feature weights highlight the 
contribution of different functional genomics data types 
to enhancer predictions. Each "+" represents the contribution 
made by a single data feature, e.g. H3K4mel peaks from 
embryonic stem cells, to the classification in EnhancerFinder Step 
1 (developmental enhancers versus genomic background). Positive 
weights (red) indicate an association with enhancer activity in our 
analysis and negative weights (blue) suggest a lack of enhancer 
activity. The features plotted here come from a range of likely 
relevant contexts (Relevant Functional Genomics classifier; 
Table SI), and the number of data sets present for each feature 
type is given in parentheses. The black bar gives the average 
weight over all features of each type. In general, the features with 
high average weights, such as H3K3mel, p300, and H3K4me2, 
are known to be associated with enhancers, while those with large 
negative weights are associated with other types of genomic 
regions. However, no data type has uniformly positive or negative 
weights in all contexts. 
(PDF) 

Figure S7 Heart enhancers are less conserved and 
closer to the nearest transcription start site (TSS) than 
limb and brain enhancers. Considering only limb and brain 
enhancers that are less evolutionarily conserved and close to a TSS 
improved our ability to identify them, but they are still more 
difficult to identify than heart enhancers. In addition to these 
features, heart enhancers have uniquely high GC content 
compared to other enhancers and the genomic background 
(Figure S7). 
(PDF) 

Figure S8 The uniquely high GC content of heart 
enhancers in VISTA enables accurate classification. 

The VISTA heart enhancers have higher GC content (49%) than 
other types of enhancers and the genomic background (~40%). 
(A) The classification score from a spectrum kernel classifier 
trained to distinguish heart enhancers within VISTA (Step 2) is 
strongly correlated (Pearson rho = 0.95) with the GC content of 
the input region. (B) A classification algorithm based solely on GC 
content (black) performs competitively with the spectrum kernel 
(AUC of 0.80 vs. 0.82), and nearly as well as EnhancerFinder 
(0.85; Figure 4). 
(PDF) 



Figure S9 Enhancers active in multiple tissues are 
easier to identify than those active in a single tissue. 

There are 399 enhancers active in a single tissue at El 1.5 in the 
VISTA database and 312 active in multiple tissues. EnhancerFin- 
der is better able to distinguish the enhancers active in multiple 
tissues from the VISTA negatives (AUC = 0.75) than it is to 
distinguish single tissue enhancers from the negatives 
(AUC = 0.67). This trend also holds across each tissue individually. 
However, both sets are easy to distinguish from the genomic 
background (AUC = 0.96 for both, not shown). 
(PDF) 

Figure S10 Three novel developmental enhancers near 
FOXC1. This UCSC Genome Browser screenshot shows six 
candidate enhancer regions tested in transgenic zebrafish. Three 
of the regions showed positive or suggestive expression at 24 or 
48 hpf. F1EC-1 drives expression at 48 hpf; the arrows highlight 
reproducible midbrain, spinal cord, and epidermis expression. 
F1EC-3 shows suggestive expression at 24 hpf in somitic muscles 
and the epidermis (arrows). F1EC-6 drives expression at 48 hpf in 
the pericardium and heart (suggestive). The other three tested 
candidate enhancers without corresponding zebrafish images were 
negative in the enhancer assay. See Table S6 for full list of 
expressed tissues seen in each candidate enhancer. 
(PDF) 

Figure Sll Transient transgenic mouse embryos sup- 
port a novel cranial nerve enhancer near ZEB2. Seven 
transient transgenic mouse embryos showed LacZ expression at 
embryonic day 11.5. Constructs containing a 999 bp region 
(hgl9.chr2:145,234,541-145,235,539) including 2xHAR.240 
near ^EB2, a minimal promoter, and Lac£ were used for 
human. The orthologous region was used in the chimp construct 
( P anTro2.chr2b: 148,81 1,929-148,812,929). Three embryos 
with constructs containing the human version of the region of 
interest and four embryos containing the chimp sequence had 
staining. In all embryos, there was consistent expression in the 
cranial nerve. There does not appear to be a significant 
difference in the activity driven by the human and chimp 
sequences at this time point. 
(PDF) 

Table SI Functional genomics features used in our 
analysis. This Excel spreadsheet lists the files used from 
ENCODE (http://genome.ucsc.edu/ENCODE/) or GEO 
(http://www.ncbi.nlm.nih.gov/geo/). There is a sheet for each 
of the classifiers based on functional genomics data that lists all 
data files used. ENCODE data set names are UCSC track names. 
GEO data set names are GEO identifiers. 
(XLS) 

Table S2 Genes near brain enhancers have significant- 
ly higher gene expression in brain and neural tissues 
than genes near heart enhancers. Brain- or heart-related 
tissues with significantly higher mean expression in genes 
associated with predicted brain enhancers compared to predicted 
heart enhancers. 
(DOC) 

Table S3 Genes near heart enhancers have significant- 
ly higher gene expression in cardiac-related tissues 
than genes near brain enhancers. Brain- or heart-related 
tissues with significantly higher mean expression in genes 
associated with predicted heart enhancers compared to predicted 
brain enhancers. 
(DOC) 
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Table S4 The top 25 transcription factors for which 
binding sites were most prevalent in brain, heart, and 
limb enhancers. 

(DOC) 

Table S5 676 GWAS SNPs are found in predicted 
enhancers. This Excel spreadsheet lists all GWAS SNPs from the 
NHGRI database that fall within one of our predicted enhancers. 
(XLSX) 

Table S6 Candidate enhancer regions tested in zebra- 
fish. We tested 10 candidate enhancer regions in a transgenic 
zebrafish assay. This table lists the genomic coordinates (hgl9) and 
expression patterns observed for each construct at 24 and 48 hpf A 
representative fish is shown for each positive enhancer in (Figures 7 
and S9). Candidate enhancers on chromosome 6 are near FOXC 1 , 
and those on chromosome 16 are near FOXC2. N is the 
number of zebrafish alive at the specified time point, and * 
indicates expression patterns that are "suggestive," but below 
the 15% threshold we used for confirmed enhancers. 
(DOC) 

Data File SI This ZIP archive contains BED files 
(hgl9 coordinates) with Enhancer Finder's genome-wide 
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